VQ-based written language identification
نویسندگان
چکیده
Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method.
منابع مشابه
Asthma in Iranian Schoolchildren: Comparison of ISAAC Video and Written Questionnaires
Background: The international study of asthma and allergies in childhood (ISAAC) is used to define the prevalence and severity of asthma in different regions. In this study we followed the performance of the ISAAC video and written questionnaires (VQ and WQ) to classify asthma in 13-14 yr-old schoolchildren. Methods: The present study was carried out on 3540 schoolchildren 13 to 14-yrs-old us...
متن کاملSemantic-Based Image Retrial in the VQ Compressed Domain using Image Annotation Statistical Models
متن کامل
Two-stage speaker identification system based on VQ and NBDGMM
In this paper, a new speaker identification system is presented. The system can be divided into two subsystems, one close-set speaker identification system and one speaker verification system. The VQ model is used in the close-set speaker identification system and a new method called NBDGMM (Normalization Based on Difference of GMM) is introduced. Experiments have been done to prove that this s...
متن کاملVq-based Bayesian Estimation for Blur Identification and Image Selection in Video Sequences
We address the problem of blur identification and image selection with statistical blur priors in the context of the vector quantization (VQ) based framework. Firstly, we assume some dominant blur priors for estimating point spread functions (PSFs) of blurred frames in Bayesian MAP estimation. The blurred frames with estimated PSFs can be stored in VQ-based multiple codebooks. These codebooks c...
متن کاملGraph-Based N-gram Language Identification on Short Texts
Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003